1 Task

Spooky Author Identification

The competition dataset contains text from works of fiction written by spooky authors of the public domain: Edgar Allan Poe, HP Lovecraft and Mary Shelley. The data was prepared by chunking larger texts into sentences using CoreNLP’s MaxEnt sentence tokenizer, so you may notice the odd non-sentence here and there. Your objective is to accurately identify the author of the sentences in the test set.

2 Preparation

2.1 Used libraries

# common
library('scales')
library('gridExtra')
library('corrplot')
library("ggcorrplot")
library('tidyverse')
library('reshape2')

# NLP
library('tidytext')
library('tm')
library('SnowballC')
library('topicmodels')
library('wordcloud')
library('qdap')
library('textstem')
library('udpipe')

# models
library('xgboost')
library('caret')
library('keras')
library('fastNaiveBayes')

2.2 Data loading

train <- read_csv('data/train.csv')
test <- read_csv('data/test.csv')
sample <- read.csv('data/sample_submission.csv')

3 Exploratory Data Analysis

3.1 Data structure

3.1.1 Train data

train %>%
    dplyr::slice(1:5)
id text author
id26305 This process, however, afforded me no means of ascertaining the dimensions of my dungeon; as I might make its circuit, and return to the point whence I set out, without being aware of the fact; so perfectly uniform seemed the wall. EAP
id17569 It never once occurred to me that the fumbling might be a mere mistake. HPL
id11008 In his left hand was a gold snuff box, from which, as he capered down the hill, cutting all manner of fantastic steps, he took snuff incessantly with an air of the greatest possible self satisfaction. EAP
id27763 How lovely is spring As we looked from Windsor Terrace on the sixteen fertile counties spread beneath, speckled by happy cottages and wealthier towns, all looked as in former years, heart cheering and fair. MWS
id12958 Finding nothing else, not even gold, the Superintendent abandoned his attempts; but a perplexed look occasionally steals over his countenance as he sits thinking at his desk. HPL
glimpse(train)
## Rows: 19,579
## Columns: 3
## $ id     <chr> "id26305", "id17569", "id11008", "id27763", "id12958", "id22965…
## $ text   <chr> "This process, however, afforded me no means of ascertaining th…
## $ author <chr> "EAP", "HPL", "EAP", "MWS", "HPL", "MWS", "EAP", "EAP", "EAP", …
summary(train)
##       id                text              author         
##  Length:19579       Length:19579       Length:19579      
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character
train %>%
    dplyr::slice(which.max(nchar(text)))
id text author
id27184 Diotima approached the fountain seated herself on a mossy mound near it and her disciples placed themselves on the grass near her Without noticing me who sat close under her she continued her discourse addressing as it happened one or other of her listeners but before I attempt to repeat her words I will describe the chief of these whom she appeared to wish principally to impress One was a woman of about years of age in the full enjoyment of the most exquisite beauty her golden hair floated in ringlets on her shoulders her hazle eyes were shaded by heavy lids and her mouth the lips apart seemed to breathe sensibility But she appeared thoughtful unhappy her cheek was pale she seemed as if accustomed to suffer and as if the lessons she now heard were the only words of wisdom to which she had ever listened The youth beside her had a far different aspect his form was emaciated nearly to a shadow his features were handsome but thin worn his eyes glistened as if animating the visage of decay his forehead was expansive but there was a doubt perplexity in his looks that seemed to say that although he had sought wisdom he had got entangled in some mysterious mazes from which he in vain endeavoured to extricate himself As Diotima spoke his colour went came with quick changes the flexible muscles of his countenance shewed every impression that his mind received he seemed one who in life had studied hard but whose feeble frame sunk beneath the weight of the mere exertion of life the spark of intelligence burned with uncommon strength within him but that of life seemed ever on the eve of fading At present I shall not describe any other of this groupe but with deep attention try to recall in my memory some of the words of Diotima they were words of fire but their path is faintly marked on my recollection It requires a just hand, said she continuing her discourse, to weigh divide the good from evil On the earth they are inextricably entangled and if you would cast away what there appears an evil a multitude of beneficial causes or effects cling to it mock your labour When I was on earth and have walked in a solitary country during the silence of night have beheld the multitude of stars, the soft radiance of the moon reflected on the sea, which was studded by lovely islands When I have felt the soft breeze steal across my cheek as the words of love it has soothed cherished me then my mind seemed almost to quit the body that confined it to the earth with a quick mental sense to mingle with the scene that I hardly saw I felt Then I have exclaimed, oh world how beautiful thou art Oh brightest universe behold thy worshiper spirit of beauty of sympathy which pervades all things, now lifts my soul as with wings, how have you animated the light the breezes Deep inexplicable spirit give me words to express my adoration; my mind is hurried away but with language I cannot tell how I feel thy loveliness Silence or the song of the nightingale the momentary apparition of some bird that flies quietly past all seems animated with thee more than all the deep sky studded with worlds” If the winds roared tore the sea and the dreadful lightnings seemed falling around me still love was mingled with the sacred terror I felt; the majesty of loveliness was deeply impressed on me So also I have felt when I have seen a lovely countenance or heard solemn music or the eloquence of divine wisdom flowing from the lips of one of its worshippers a lovely animal or even the graceful undulations of trees inanimate objects have excited in me the same deep feeling of love beauty; a feeling which while it made me alive eager to seek the cause animator of the scene, yet satisfied me by its very depth as if I had already found the solution to my enquires sic as if in feeling myself a part of the great whole I had found the truth secret of the universe But when retired in my cell I have studied contemplated the various motions and actions in the world the weight of evil has confounded me If I thought of the creation I saw an eternal chain of evil linked one to the other from the great whale who in the sea swallows destroys multitudes the smaller fish that live on him also torment him to madness to the cat whose pleasure it is to torment her prey I saw the whole creation filled with pain each creature seems to exist through the misery of another death havoc is the watchword of the animated world And Man also even in Athens the most civilized spot on the earth what a multitude of mean passions envy, malice a restless desire to depreciate all that was great and good did I see And in the dominions of the great being I saw man reduced? MWS
train %>%
    dplyr::slice(which.min(nchar(text)))
id text author
id20021 I breathed no longer. EAP
colSums(is.na(train))
##     id   text author 
##      0      0      0

3.1.2 Test data

test %>%
    dplyr::slice(1:5)
id text
id02310 Still, as I urged our leaving Ireland with such inquietude and impatience, my father thought it best to yield.
id24541 If a fire wanted fanning, it could readily be fanned with a newspaper, and as the government grew weaker, I have no doubt that leather and iron acquired durability in proportion, for, in a very short time, there was not a pair of bellows in all Rotterdam that ever stood in need of a stitch or required the assistance of a hammer.
id00134 And when they had broken down the frail door they found only this: two cleanly picked human skeletons on the earthen floor, and a number of singular beetles crawling in the shadowy corners.
id27757 While I was thinking how I should possibly manage without them, one actually tumbled out of my head, and, rolling down the steep side of the steeple, lodged in the rain gutter which ran along the eaves of the main building.
id04081 I am not sure to what limit his knowledge may extend.
glimpse(test)
## Rows: 8,392
## Columns: 2
## $ id   <chr> "id02310", "id24541", "id00134", "id27757", "id04081", "id27337",…
## $ text <chr> "Still, as I urged our leaving Ireland with such inquietude and i…
summary(test)
##       id                text          
##  Length:8392        Length:8392       
##  Class :character   Class :character  
##  Mode  :character   Mode  :character
test %>%
    dplyr::slice(which.max(nchar(text)))
id text
id20462 I gasped could not ask that which I longed to know the friendly spirit replied more gravely I have told you that you will not see those whom you mourn But I must away follow me or I must leave you weeping deserted by the spirit that now checks your tears Go I replied I cannot follow I can only sit here grieve long to see those who are gone for ever for to nought but what has relation to them can I listen The spirit left me to groan weep to wish the sun quenched in eternal darkness to accuse the air the waters all all the universe of my utter irremediable misery Fantasia came again and ever when she came tempted me to follow her but as to follow her was to leave for a while the thought of those loved ones whose memories were my all although they were my torment I dared not go Stay with me I cried help me to clothe my bitter thoughts in lovelier colours give me hope although fallacious images of what has been although it never will be again diversion I cannot take cruel fairy do you leave me alas all my joy fades at thy departure but I may not follow thee One day after one of these combats when the spirit had left me I wandered on along the banks of the river to try to disperse the excessive misery that I felt untill overcome by fatigue my eyes weighed down by tears I lay down under the shade of trees fell asleep I slept long and when I awoke I knew not where I was I did not see the river or the distant city but I lay beside a lovely fountain shadowed over by willows surrounded by blooming myrtles at a short distance the air seemed pierced by the spiry pines cypresses and the ground was covered by short moss sweet smelling heath the sky was blue but not dazzling like that of Rome and on every side I saw long allies clusters of trees with intervening lawns gently stealing rivers Where am I? I exclaimed looking around me I beheld Fantasia She smiled as she smiled all the enchanting scene appeared lovelier rainbows played in the fountain the heath flowers at our feet appeared as if just refreshed by dew I have seized you, said she as you slept and will for some little time retain you as my prisoner I will introduce you to some of the inhabitants of these peaceful Gardens It shall not be to any whose exuberant happiness will form an unpleasing contrast with your heavy grief but it shall be to those whose chief care here is to acquired knowledged sic virtue or to those who having just escaped from care pain have not yet recovered full sense of enjoyment This part of these Elysian Gardens is devoted to those who as before in your world wished to become wise virtuous by study action here endeavour after the same ends by contemplation They are still unknowing of their final destination but they have a clear knowledge of what on earth is only supposed by some which is that their happiness now hereafter depends upon their intellectual improvement Nor do they only study the forms of this universe but search deeply in their own minds and love to meet converse on all those high subjects of which the philosophers of Athens loved to treat With deep feelings but with no outward circumstances to excite their passions you will perhaps imagine that their life is uniform dull but these sages are of that disposition fitted to find wisdom in every thing in every lovely colour or form ideas that excite their love Besides many years are consumed before they arrive here When a soul longing for knowledge pining at its narrow conceptions escapes from your earth many spirits wait to receive it and to open its eyes to the mysteries of the universe many centuries are often consumed in these travels and they at last retire here to digest their knowledge to become still wiser by thought and imagination working upon memory When the fitting period is accomplished they leave this garden to inhabit another world fitted for the reception of beings almost infinitely wise but what this world is neither can you conceive or I teach you some of the spirits whom you will see here are yet unknowing in the secrets of nature They are those whom care sorrow have consumed on earth whose hearts although active in virtue have been shut through suffering from knowledge These spend sometime here to recover their equanimity to get a thirst of knowledge from converse with their wiser companions They now securely hope to see again those whom they love know that it is ignorance alone that detains them from them.
test %>%
    dplyr::slice(which.min(nchar(text)))
id text
id02295 Then he gave a start.
colSums(is.na(test))
##   id text 
##    0    0

3.2 Class distribution

table(train$author) %>%
    prop.table()
## 
##       EAP       HPL       MWS 
## 0.4034935 0.2878084 0.3086981
train %>% 
    group_by(author) %>%
    summarise(n = n()) %>% 
    ggplot(data = ., aes(x = author, y = n, fill = author)) +
    geom_col(show.legend = F) +
    scale_fill_manual(values = c(HPL = 'blue4', EAP = 'red4', MWS = 'purple4')) +
    xlab(label = 'author') +
    ylab(label = 'number of texts') +
    theme_bw()

train %>% 
    mutate(len = nchar(text)) %>% 
    ggplot(data = ., aes(x = len, fill = author)) +
    geom_histogram(binwidth = 50) +
    scale_fill_manual(values = c(HPL = 'blue4', EAP = 'red4', MWS = 'purple4')) +
    facet_grid(. ~ author) +
    xlab(label = 'number of characters') +
    ylab(label = 'number of texts') +
    theme_bw()

train %>% 
    mutate(len = nchar(text)) %>% 
    group_by(author) %>% 
    summarise(`median text length` = median(len), `mean text length` = mean(len))
author median text length mean text length
EAP 115 142.2259
HPL 142 155.8435
MWS 130 151.6598

3.3 Wordclouds

train %>%
    unnest_tokens(word, text) %>% 
    count(word) %>% 
    with(wordcloud(word, n, max.words = 50, scale = c(3, 1.5), random.order = F, rot.per = 0, color = 'black'))

train_clean <- train %>% 
    unnest_tokens(word, text) %>% 
    anti_join(stop_words, by = 'word')

train_clean %>%
    count(word) %>% 
    with(wordcloud(word, n, max.words = 50, scale = c(3, 1), random.order = F, rot.per = 0, color = 'black'))

3.3.1 Lovecraft

train_clean %>%
    filter(author == 'HPL') %>% 
    count(word) %>% 
    with(wordcloud(word, n, max.words = 50, scale = c(3, 1), random.order = F, rot.per = 0, color = 'blue4'))

3.3.2 Poe

train_clean %>%
    filter(author == 'EAP') %>% 
    count(word) %>% 
    with(wordcloud(word, n, max.words = 50, scale = c(3, 1), random.order = F, rot.per = 0, color = 'red4'))

3.3.3 Shelley

train_clean %>%
    filter(author == 'MWS') %>% 
    count(word) %>% 
    with(wordcloud(word, n, max.words = 50, scale = c(3, 1), random.order = F, rot.per = 0, color = 'purple4'))

3.4 Keywords

get_bar_plot <- function(author_id, clr, stem = F){
    
    if(stem){
        t <- train_clean %>%
            filter(author == author_id) %>% 
            mutate(word = wordStem(word))
    }else{
        t <- train_clean %>%
            filter(author == author_id)
    }
    
    t %>% 
        count(word) %>% 
        top_n(30, n) %>%
        arrange(n) %>%
        mutate(word = factor(word, levels = word)) %>%
        ggplot() +
        geom_col(aes(word, n), fill = clr) +
        ggtitle(author_id) +
        coord_flip()
}

3.4.1 Plain keywords

pl1 <- get_bar_plot(author_id = 'HPL', clr = 'blue4')
pl2 <- get_bar_plot(author_id = 'EAP', clr = 'red4')
pl3 <- get_bar_plot(author_id = 'MWS', clr = 'purple4')

grid.arrange(pl1, pl2, pl3, nrow = 1)

3.4.2 Keywords + stemming

pl1 <- get_bar_plot(author_id = 'HPL', clr = 'blue4', stem = T)
pl2 <- get_bar_plot(author_id = 'EAP', clr = 'red4', stem = T)
pl3 <- get_bar_plot(author_id = 'MWS', clr = 'purple4', stem = T)

grid.arrange(pl1, pl2, pl3, nrow = 1)

3.5 Unique words

train_clean %>%
    distinct(author, word) %>% 
    count(author) %>% 
    rename(`unique word count` = n)
author unique word count
EAP 14856
HPL 14188
MWS 11115
get_uniq_words_plot <- function(author_id){
    
    t <- train_clean %>% 
        filter(author == author_id) %>% 
        count(word) %>% 
        arrange(desc(n))
    
    t %>% 
        mutate(cumsum = cumsum(n),
               cumsum_perc = round(100 * cumsum/sum(n), digits = 2)) %>% 
        ggplot(aes(x = 1:nrow(t), y = cumsum_perc)) +
        geom_line() +
        geom_hline(yintercept = 50, color = 'black', alpha = 0.5) +
        geom_hline(yintercept = 75, color = 'yellow', alpha = 0.5) +
        geom_hline(yintercept = 90, color = 'orange', alpha = 0.5) +
        geom_hline(yintercept = 95, color = 'red', alpha = 0.5) +
        xlab('number of unique words') +
        ylab('% coverage') +
        ggtitle(author_id) +
        theme_bw()
        
}

pl1 <- get_uniq_words_plot(author_id = 'HPL')
pl2 <- get_uniq_words_plot(author_id = 'EAP')
pl3 <- get_uniq_words_plot(author_id = 'MWS')

grid.arrange(pl1, pl2, pl3, nrow = 1)

3.6 Words co-occurrence

word_freq <- train_clean %>%
    count(author, word) %>%
    group_by(author)  %>%
    mutate(prop = n / sum(n)) %>% 
    select(-n) %>% 
    spread(author, prop)

get_word_freq_plot <- function(author_id1, author_id2){
    word_freq %>% 
        filter(!is.na(!!author_id1) & !is.na(!!author_id2)) %>% 
        mutate(clr = abs(!!author_id1 - !!author_id2)) %>%
        ggplot(aes(x = !!author_id1, y = !!author_id2, color = clr)) +
        geom_abline(color = 'gray40', lty = 2) +
        geom_jitter(alpha = 0.1, size = 2, width = 0.3, height = 0.3) +
        geom_text(aes(label = word), check_overlap = T, vjust = 1.5) +
        scale_x_log10(labels = percent_format()) +
        scale_y_log10(labels = percent_format()) +
        theme_bw() +
        theme(legend.position = 'none') +
        labs(x = author_id1, y = author_id2)
}

3.6.1 Lovecraft & Poe

get_word_freq_plot(author_id1 = quo(HPL), author_id2 = quo(EAP))

3.6.2 Lovecraft & Shelley

get_word_freq_plot(author_id1 = quo(HPL), author_id2 = quo(MWS))

3.6.3 Poe & Shelley

get_word_freq_plot(author_id1 = quo(EAP), author_id2 = quo(MWS))

3.7 Correlation

word_freq %>%
    select(-word) %>%
    cor(use = 'complete.obs', method = 'pearson') %>%
    corrplot(type = 'lower',
             method = 'number',
             diag = F)

3.8 TF-IDF

tf_idf <- train_clean %>%
    count(author, word) %>% 
    bind_tf_idf(word, author, n)

get_tf_idf_plot <- function(df, cnt){
    df %>%
        arrange(desc(tf_idf)) %>%
        mutate(word = factor(word, levels = rev(unique(word)))) %>%
        top_n(cnt, tf_idf) %>%
        ggplot(aes(word, tf_idf, fill = author)) +
        scale_fill_manual(values = c(HPL = 'blue4', EAP = 'red4', MWS = 'purple4')) +
        geom_col() +
        labs(x = NULL, y = 'TF-IDF values') +
        theme_bw() +
        theme(legend.position = 'top') +
        coord_flip()
}

get_tf_idf_facet_plot <- function(df, cnt){
    df %>%
        arrange(desc(tf_idf)) %>%
        mutate(word = factor(word, levels = rev(unique(word)))) %>%
        group_by(author) %>%
        top_n(cnt, tf_idf) %>%
        ungroup() %>% 
        ggplot(aes(word, tf_idf, fill = author)) +
        scale_fill_manual(values = c(HPL = 'blue4', EAP = 'red4', MWS = 'purple4')) +
        geom_col() +
        labs(x = NULL, y = 'TF-IDF values') +
        theme_bw() +
        theme(legend.position = 'none') +
        facet_wrap(~author, ncol = 3, scales = 'free_y') +
        coord_flip()
}

3.8.1 Common unigrams

get_tf_idf_plot(tf_idf, 20)

3.8.2 Personal unigrams

get_tf_idf_facet_plot(tf_idf, 20)

3.9 Bigrams

train_clean_bigram <- train %>% 
    unnest_tokens(word, text, token = 'ngrams', n = 2) %>% 
    separate(word, c('word1', 'word2'), sep = ' ') %>% 
    anti_join(stop_words, by = c('word1' = 'word')) %>% 
    anti_join(stop_words, by = c('word2' = 'word')) %>% 
    unite(word, c('word1', 'word2'), sep = ' ')

tf_idf_bigram <- train_clean_bigram %>%
    count(author, word) %>% 
    bind_tf_idf(word, author, n)

3.9.1 Common bigrams

get_tf_idf_plot(tf_idf_bigram, 20)

3.9.2 Personal bigrams

get_tf_idf_facet_plot(tf_idf_bigram, 15)

3.10 Trigrams

train_clean_trigram <- train %>% 
    unnest_tokens(word, text, token = 'ngrams', n = 3) %>% 
    separate(word, c('word1', 'word2', 'word3'), sep = ' ') %>% 
    anti_join(stop_words, by = c('word1' = 'word')) %>% 
    anti_join(stop_words, by = c('word2' = 'word')) %>% 
    anti_join(stop_words, by = c('word3' = 'word')) %>% 
    unite(word, c('word1', 'word2', 'word3'), sep = ' ')

tf_idf_trigram <- train_clean_trigram %>%
    count(author, word) %>% 
    bind_tf_idf(word, author, n)

3.10.1 Common trigrams

get_tf_idf_plot(tf_idf_trigram, 10)

3.10.2 Personal trigrams

get_tf_idf_facet_plot(tf_idf_trigram, 6)

3.11 LDA

3.11.1 3 topics

train_dtm <-  train %>% 
    mutate(text = as.character(text), id = as.character(id), id = paste(author, id, sep="_")) %>% 
    unnest_tokens(word, text) %>% 
    anti_join(stop_words, by = 'word') %>% 
    count(id, word) %>% 
    cast_dtm(id, word, n)

train_lda <- LDA(train_dtm, k = 3, control = list(seed = 42))

train_topics <- tidy(train_lda, matrix = 'beta')

train_topics %>%
    group_by(topic) %>%
    top_n(20, beta) %>%
    ungroup() %>%
    arrange(topic, -beta) %>%
    mutate(term = reorder(term, beta)) %>%
    ggplot(aes(term, beta, fill = factor(topic))) +
    geom_col(show.legend = F) +
    facet_wrap(~ topic, scales = 'free', ncol = 5) +
    coord_flip() +
    theme_bw()

gamma_extractor <- function(lda_model) {
  tidy(lda_model, matrix="gamma") %>%
    spread(topic, gamma) %>%
    separate(document, c("author", "id"),
             sep="_",
             convert=TRUE)         %>%
    mutate(top = NA)               %>%
    mutate(prediction = NA)
}

gamma_lda <- gamma_extractor(train_lda)
length_lda <- nrow(gamma_lda)

for (i in 1:length_lda) {
  gamma_lda$top[i]        <- max(gamma_lda[i, 3:5])
  gamma_lda$prediction[i] <- match(gamma_lda[i,6], gamma_lda[i,3:5])
}

table(gamma_lda$author, gamma_lda$prediction)
##      
##          1    2    3
##   EAP 2608 2495 2736
##   HPL 1802 1975 1832
##   MWS 2115 1932 1972

3.11.2 9 topics

train_lda <- LDA(train_dtm, k = 9, control = list(seed = 42))

train_topics <- tidy(train_lda, matrix = 'beta')

train_topics %>%
    group_by(topic) %>%
    top_n(7, beta) %>%
    ungroup() %>%
    arrange(topic, -beta) %>%
    mutate(term = reorder(term, beta)) %>%
    ggplot(aes(term, beta, fill = factor(topic))) +
    geom_col(show.legend = F) +
    facet_wrap(~ topic, scales = 'free', ncol = 3) +
    coord_flip() +
    theme_bw()

gamma_lda <- gamma_extractor(train_lda)
length_lda <- nrow(gamma_lda)

for (i in 1:length_lda) {
  gamma_lda$top[i]        <- max(gamma_lda[i, 3:11])
  gamma_lda$prediction[i] <- match(gamma_lda[i,12], gamma_lda[i,3:11])
}

table(gamma_lda$author, gamma_lda$prediction)
##      
##          1    2    3    4    5    6    7    8    9
##   EAP  935  886  953  849  699 1019  817  938  743
##   HPL  550  700  603  687  600  611  592  602  664
##   MWS  604  622  527  596  813  530  907  623  797

4 Feature engineering

4.1 Stylometric features

train_stylometry <- train %>% 
    mutate(ascii_text = iconv(text, to='ASCII//TRANSLIT'),
           ascii_text = ifelse(is.na(ascii_text), text, ascii_text),
           ascii_text = str_replace_all(ascii_text, c("ΥΠΝΟΣ" = "Hypnos", "Οἶδα Οἶδα" = "Oida Oida")),
           word_count = str_count(text, '\\w+'),
           nchar = nchar(text),
           nchar_per_word = nchar/word_count,
           syll_count = syllable_sum(ascii_text),
           nsyll_per_word = syll_count/word_count,
           tc_count = str_count(text, '[A-Z][a-z]+'),
           uc_count = str_count(text, '[A-Z][A-Z]+'),
           punctuation_count = str_count(text, '[:punct:]')) %>% 
    select(-c(text, ascii_text, author))

test_stylometry <- test %>% 
    mutate(ascii_text = iconv(text, to='ASCII//TRANSLIT'),
           word_count = str_count(text, '\\w+'),
           nchar = nchar(text),
           nchar_per_word = nchar/word_count,
           syll_count = syllable_sum(ascii_text),
           nsyll_per_word = syll_count/word_count,
           tc_count = str_count(text, '[A-Z][a-z]+'),
           uc_count = str_count(text, '[A-Z][A-Z]+'),
           punctuation_count = str_count(text, '[:punct:]')) %>% 
    select(-c(text, ascii_text))
train_tmp <- train %>%
    unnest_tokens(term, text, token = 'ngrams', n=1) %>%
    mutate(stop_word_ind = as.integer(term %in% stop_words$word),
           ascii_term = str_replace_all(term, c("υπνος" = "hypnos", "οἶδα" = "oida")),
           ascii_term = iconv(ascii_term, to='ASCII//TRANSLIT'),
           ascii_term = ifelse(is.na(ascii_term), term, ascii_term),
           syll_count = syllable_sum(ascii_term, parallel = T),
           word_length = nchar(term)) %>%
    group_by(id) %>%
    summarise(nStopWord = sum(stop_word_ind),
              nUniqueWord = n_distinct(term),
              avg_word_length = mean(word_length),
              ngt6lWord = sum(word_length > 6),
              n1SylWord = sum(syll_count == 1),
              nPolySylWord = sum(syll_count > 2)
              )

test_tmp <- test %>%
    unnest_tokens(term, text, token = 'ngrams', n=1) %>%
    mutate(stop_word_ind = as.integer(term %in% stop_words$word),
           ascii_term = str_replace_all(term, c("υπνος" = "hypnos", "οἶδα" = "oida")),
           ascii_term = iconv(ascii_term, to='ASCII//TRANSLIT'),
           ascii_term = ifelse(is.na(ascii_term), term, ascii_term),
           syll_count = syllable_sum(ascii_term, parallel = T),
           word_length = nchar(term)) %>%
    group_by(id) %>%
    summarise(nStopWord = sum(stop_word_ind),
              nUniqueWord = n_distinct(term),
              avg_word_length = mean(word_length),
              ngt6lWord = sum(word_length > 6),
              n1SylWord = sum(syll_count == 1),
              nPolySylWord = sum(syll_count > 2)
              )
train_stylometry <- train_stylometry %>% 
    left_join(train_tmp, by = 'id') %>%
    mutate(unique_r = nUniqueWord/word_count,
           w_p = word_count - punctuation_count,
           w_p_r = w_p/word_count,
           stop_r = nStopWord/word_count,
           w_p_stop = w_p - nStopWord,
           w_p_stop_r = w_p_stop/word_count,
           num_words_upper_r = uc_count/word_count,
           num_words_title_r = tc_count/word_count)

test_stylometry <- test_stylometry %>% 
    left_join(test_tmp, by = 'id') %>%
    mutate(unique_r = nUniqueWord/word_count,
           w_p = word_count - punctuation_count,
           w_p_r = w_p/word_count,
           stop_r = nStopWord/word_count,
           w_p_stop = w_p - nStopWord,
           w_p_stop_r = w_p_stop/word_count,
           num_words_upper_r = uc_count/word_count,
           num_words_title_r = tc_count/word_count)
train  %>%
    left_join(train_stylometry, by = 'id') %>%
    select(-c(text, id)) %>%
    filter(nchar < 1000) %>%
    group_by(author) %>%
    summarise_all(mean) %>%
    gather(feature, value, -author) %>%
    ggplot(aes(x = feature, y = value, color = author)) +
    scale_color_manual(values = c(HPL = 'blue4', EAP = 'red4', MWS = 'purple4')) +
    geom_point(alpha = 0.6, size = 3) +
    scale_y_log10() +
    labs(x = 'feature', y = 'mean value by author') +
    coord_flip() +
    theme_bw()

4.2 Tonality-based features

4.2.1 Verb tonality and intensifiers

thought_verbs <- c('analyze', 'apprehend', 'assume', 'believe', 'calculate', 'cerebrate', 'cogitate',
                   'comprehend', 'conceive', 'concentrate', 'conceptualize', 'conclude', 'consider',
                   'construe', 'contemplate', 'deduce', 'deem', 'delibrate', 'desire', 'diagnose',
                   'doubt', 'envisage', 'envision', 'evaluate', 'excogitate', 'extrapolate', 'fantasize',
                   'forget', 'forgive', 'formulate', 'hate', 'hypothesize', 'imagine', 'infer', 
                   'intellectualize', 'intrigue', 'guess', 'introspect', 'judge', 'know', 'love', 
                   'lucubrate', 'marvel', 'meditate', 'note', 'notice', 'opine', 'perpend', 'philosophize',
                   'ponder', 'question', 'ratiocinate', 'rationalize', 'realize', 'reason', 'recollect', 
                   'reflect', 'remember', 'reminisce', 'retrospect', 'ruminate', 'sense', 'speculate',
                   'stew', 'strategize', 'suppose', 'suspect', 'syllogize', 'theorize', 'think', 
                   'understand', 'visualize', 'want', 'weigh', 'wonder')

loud_verbs <- c('cry', 'exclaim', 'shout', 'roar', 'scream', 'shriek', 'vociferated', 'bawl',
                'call', 'ejaculate', 'retort', 'proclaim', 'announce', 'protest', 'accost', 'declare')

neutral_verbs <- c('say', 'reply', 'observe', 'rejoin', 'ask', 'answer', 'return', 'repeat', 'remark',
                   'enquire', 'respond', 'suggest', 'explain', 'utter', 'mention')

quiet_verbs <- c('whisper', 'murmur', 'sigh', 'grumble', 'mumble', 'mutter', 'whimper', 'hush', 'falter',
                 'stammer', 'tremble', 'gasp', 'shudder')

qualifiers <- c('very', 'too', 'so', 'quite', 'rather', 'little', 'pretty', 'somewhat', 'various', 'almost', 
                'much', 'just', 'indeed', 'still', 'even', 'a lot', 'kind of', 'sort of')

train_tmp <- train %>%
    unnest_tokens(term, text, token = 'ngrams', n=1) %>% 
    bind_rows(train %>% unnest_tokens(term, text, token = 'ngrams', n=2)) %>%
    mutate(term = lemmatize_words(term),
           qualifier_ind = as.integer(term %in% qualifiers),
           thought_verbs_ind = as.integer(term %in% thought_verbs),
           loud_verbs_ind = as.integer(term %in% loud_verbs),
           neutral_verbs_ind = as.integer(term %in% neutral_verbs),
           quiet_verbs_ind = as.integer(term %in% quiet_verbs)) %>%
    group_by(id) %>%
    summarise(qualifier_count = sum(qualifier_ind),
              thought_verbs_count = sum(thought_verbs_ind),
              loud_verbs_count = sum(loud_verbs_ind),
              neutral_verbs_count = sum(neutral_verbs_ind),
              quiet_verbs_count = sum(quiet_verbs_ind))

test_tmp <- test %>%
    unnest_tokens(term, text, token = 'ngrams', n=1) %>% 
    bind_rows(test %>% unnest_tokens(term, text, token = 'ngrams', n=2)) %>%
    mutate(term = lemmatize_words(term),
           qualifier_ind = as.integer(term %in% qualifiers),
           thought_verbs_ind = as.integer(term %in% thought_verbs),
           loud_verbs_ind = as.integer(term %in% loud_verbs),
           neutral_verbs_ind = as.integer(term %in% neutral_verbs),
           quiet_verbs_ind = as.integer(term %in% quiet_verbs)) %>%
    group_by(id) %>%
    summarise(qualifier_count = sum(qualifier_ind),
              thought_verbs_count = sum(thought_verbs_ind),
              loud_verbs_count = sum(loud_verbs_ind),
              neutral_verbs_count = sum(neutral_verbs_ind),
              quiet_verbs_count = sum(quiet_verbs_ind))
 
train_stylometry <- train_stylometry %>% 
    left_join(train_tmp, by = 'id')

test_stylometry <- test_stylometry %>% 
    left_join(test_tmp, by = 'id')
train  %>%
    left_join(train_tmp, by = 'id') %>%
    select(-c(text, id)) %>%
    group_by(author) %>%
    summarise_all(mean) %>%
    gather(feature, value, -author) %>%
    ggplot(aes(x = feature, y = value, color = author)) +
    scale_color_manual(values = c(HPL = 'blue4', EAP = 'red4', MWS = 'purple4')) +
    geom_point(alpha = 0.6, size = 3) +
    scale_y_log10() +
    labs(x = 'feature', y = 'mean value by author') +
    coord_flip() +
    theme_bw()

4.2.2 Sentiment analysis

train_senti <- train %>%
    unnest_tokens(word, text) %>%
    inner_join(get_sentiments('nrc'), by = 'word') %>%
    count(id, sentiment) %>%
    spread(sentiment, n, sep = '_', fill = 0)

test_senti <- test %>%
    unnest_tokens(word, text) %>%
    inner_join(get_sentiments('nrc'), by = 'word') %>%
    count(id, sentiment) %>%
    spread(sentiment, n, sep = '_', fill = 0)

afinn_sentiment <- train %>%
    select(-author) %>%
    bind_rows(test) %>%
    unnest_tokens(word, text) %>%
    inner_join(get_sentiments('afinn'), by = 'word') %>%
    group_by(id) %>%
    summarise(sentiment_afinn = mean(value))

train_senti <- train_senti %>%
    left_join(afinn_sentiment, by = 'id') %>%
    replace_na(list(sentiment_afinn = 0))

test_senti <- test_senti %>%
    left_join(afinn_sentiment, by = 'id') %>%
    replace_na(list(sentiment_afinn = 0))
train %>%
    right_join(train_senti, by = 'id') %>%
    select(-c(text, id)) %>%
    group_by(author) %>%
    summarise_all(mean) %>%
    gather(feature, value, -author) %>%
    ggplot(aes(x = feature, y = value, color = author)) +
    scale_color_manual(values = c(HPL = 'blue4', EAP = 'red4', MWS = 'purple4')) +
    geom_point(alpha = 0.6, size = 3) +
    labs(x = 'feature', y = 'mean value by author') +
    coord_flip() +
    theme_bw()

4.3 POS-tagging

train <- train %>% 
    mutate(clean_text = str_replace(text, '^, ', ''),
           clean_text = str_replace(text, '^\\.\" ', ''),
           clean_text = str_replace_all(text, '\"', ''))

test <- test %>% 
    mutate(clean_text = str_replace_all(text, '\"', ''))
udmodel <- udpipe_download_model(language = 'english')
udmodel <- udpipe_load_model(file = udmodel$file_model)
x <- udpipe_annotate(object = udmodel, x = train$clean_text,  doc_id = train$id)
x <- as.data.frame(x)
y <- udpipe_annotate(object = udmodel, x = test$clean_text,  doc_id = test$id)
y <- as.data.frame(y)
train_udp_pos_count <- x %>%
    count(doc_id, xpos) %>%
    spread(xpos, n, fill = 0) %>%
    rename(id = doc_id)

test_udp_pos_count <- y %>%
    count(doc_id, xpos) %>%
    spread(xpos, n, fill = 0) %>% 
    select(-c(`''`, `,`, `.`, `:`, '``')) %>%
    rename(id = doc_id)

train_udp_pos_count <- train_udp_pos_count %>% 
    select(intersect(colnames(.), colnames(test_udp_pos_count)))

train_udp_upos_count <- x %>%
    count(doc_id, upos) %>%
    spread(upos, n, fill = 0) %>% 
    select(-PUNCT) %>%
    rename(id = doc_id)

test_udp_upos_count <- y %>%
    count(doc_id, upos) %>%
    spread(upos, n, fill = 0) %>% 
    select(-PUNCT) %>%
    rename(id = doc_id)

train_pos_tagged <- x
test_pos_tagged <- y

4.3.1 Language-specific POS tags

train %>%  
    left_join(train_udp_pos_count, by = 'id') %>%
    select(-c(text, id, clean_text)) %>%
    group_by(author) %>%
    summarise_all(mean) %>%
    gather(feature, value, -author) %>%
    ggplot(aes(x = feature, y = value, color = author)) +
    scale_color_manual(values = c(HPL = 'blue4', EAP = 'red4', MWS = 'purple4')) +
    geom_point(alpha = 0.6, size = 3) +
    labs(x = 'feature', y = 'mean value by author') +
    coord_flip() +
    theme_bw()

4.3.2 Universal POS tags

train %>%  
    left_join(train_udp_upos_count, by = 'id') %>%
    select(-c(text, id, clean_text)) %>%
    group_by(author) %>%
    summarise_all(mean) %>%
    gather(feature, value, -author) %>%
    ggplot(aes(x = feature, y = value, color = author)) +
    scale_color_manual(values = c(HPL = 'blue4', EAP = 'red4', MWS = 'purple4')) +
    geom_point(alpha = 0.6, size = 3) +
    labs(x = 'feature', y = 'mean value by author') +
    coord_flip() +
    theme_bw()

4.4 N-gram-based features

author_unigrams_tfidf <- train %>%
    unnest_tokens(term, text, token = 'ngrams', n=1) %>%
    count(author, term) %>%
    bind_tf_idf(term, author, n)

author_bigrams_tfidf <- train %>%
    unnest_tokens(term, text, token = 'ngrams', n=2) %>%
    count(author, term) %>%
    bind_tf_idf(term, author, n)

author_trigrams_tfidf <- train %>%
    unnest_tokens(term, text, token = 'ngrams', n=3) %>%
    count(author, term) %>%
    bind_tf_idf(term, author, n)

author_tetragrams_tfidf <- train %>%
    unnest_tokens(term, text, token = 'ngrams', n=4) %>%
    count(author, term) %>%
    bind_tf_idf(term, author, n)

author_char_bigrams_tfidf <- train %>% 
    unnest_tokens(shingle, text, token = 'character_shingles', n=2, strip_non_alphanum = F) %>%
    count(author, shingle) %>%
    bind_tf_idf(shingle, author, n)

author_char_trigrams_tfidf <- train %>% 
    unnest_tokens(shingle, text, token = 'character_shingles', n=3, strip_non_alphanum = F) %>%
    count(author, shingle) %>%
    bind_tf_idf(shingle, author, n)

author_char_tetragrams_tfidf <- train %>% 
    unnest_tokens(shingle, text, token = 'character_shingles', n=4, strip_non_alphanum = F) %>%
    count(author, shingle) %>%
    bind_tf_idf(shingle, author, n)

author_char_pentagrams_tfidf <- train %>% 
    unnest_tokens(shingle, text, token = 'character_shingles', n=5, strip_non_alphanum = F) %>%
    count(author, shingle) %>%
    bind_tf_idf(shingle, author, n)
get_author_ngrams <- function(df, author_id){
    df %>% filter(idf == max(idf, na.rm = T) & author == author_id & n > 25) %>% .[,2]
}

df_ngram <- list(author_unigrams_tfidf, author_bigrams_tfidf, author_trigrams_tfidf, author_tetragrams_tfidf,
                author_char_bigrams_tfidf, author_char_trigrams_tfidf, author_char_tetragrams_tfidf,
                author_char_pentagrams_tfidf)

eap <- mapply(get_author_ngrams, df_ngram, rep('EAP', 8)) %>% unlist() %>% unique()
hpl <- mapply(get_author_ngrams, df_ngram, rep('HPL', 8)) %>% unlist() %>% unique()
mws <- mapply(get_author_ngrams, df_ngram, rep('MWS', 8)) %>% unlist() %>% unique()

get_total_ngrams <- function(df){
    map_df(1:4, ~ unnest_tokens(df, term, text, token = 'ngrams', n = .x)) %>%
    bind_rows(map_df(2:5, ~ unnest_tokens(df, term, text, token = 'character_shingles', n = .x,
                                          strip_non_alphanum = F))) %>%
    mutate(EAP_only_ind = as.integer(term %in% eap),
           HPL_only_ind = as.integer(term %in% hpl),
           MWS_only_ind = as.integer(term %in% mws)) %>%
    group_by(id) %>%
    summarise(EAP_only_count = sum(EAP_only_ind),
              HPL_only_count = sum(HPL_only_ind),
              MWS_only_count = sum(MWS_only_ind))
}

train_author_only <- get_total_ngrams(train)
test_author_only <- get_total_ngrams(test)

get_author_pair_ngrams <- function(df, author_id1, author_id2){
    df %>% filter(author == author_id1 | author == author_id2, idf == log(1.5)) %>%
        group_by_at(2) %>% 
        count(wt = n) %>% filter(n > 50) %>% .[,1]
}

eap_hpl <- mapply(get_author_pair_ngrams, df_ngram, rep('EAP', 8), rep('HPL', 8)) %>% unlist() %>% unique()
eap_mws <- mapply(get_author_pair_ngrams, df_ngram, rep('EAP', 8), rep('MWS', 8)) %>% unlist() %>% unique()
hpl_mws <- mapply(get_author_pair_ngrams, df_ngram, rep('HPL', 8), rep('MWS', 8)) %>% unlist() %>% unique()

get_total_pair_ngrams <- function(df){
    map_df(1:4, ~ unnest_tokens(df, term, text, token = 'ngrams', n = .x)) %>%
    bind_rows(map_df(2:5, ~ unnest_tokens(df, term, text, token = 'character_shingles', n = .x,
                                          strip_non_alphanum = F))) %>%
    mutate(EAP_HPL_only_ind = as.integer(term %in% eap_hpl),
           EAP_MWS_only_ind = as.integer(term %in% eap_mws),
           HPL_MWS_only_ind = as.integer(term %in% hpl_mws)) %>%
    group_by(id) %>%
    summarise(EAP_HPL_only_count = sum(EAP_HPL_only_ind),
              EAP_MWS_only_count = sum(EAP_MWS_only_ind),
              HPL_MWS_only_count = sum(HPL_MWS_only_ind))
}

train_author_pair_only <- get_total_pair_ngrams(train)
test_author_pair_only <- get_total_pair_ngrams(test)
train %>% 
    left_join(train_author_only, by = 'id') %>%
    left_join(train_author_pair_only, by = 'id') %>%
    select(-c(text, id, clean_text)) %>%
    group_by(author) %>%
    summarise_all(mean) %>%
    gather(feature, value, -author) %>%
    ggplot(aes(x = feature, y = value, color = author)) +
    scale_color_manual(values = c(HPL = 'blue4', EAP = 'red4', MWS = 'purple4')) +
    geom_point(alpha = 0.6, size = 3) +
    labs(x = 'feature', y = 'mean value by author') +
    coord_flip() +
    theme_bw()

4.5 Gender features

get_gender_plot <- function(m, f){
    train %>%
        unnest_tokens(word, text) %>%
        filter((word == m) | (word == f)) %>%
        mutate(word = as.factor(word)) %>%
        mutate(word = fct_relevel(word, f, m)) %>%
        ggplot(aes(word, fill = author)) +
        geom_bar(position = 'dodge') +
        scale_fill_manual(values = c(HPL = 'blue4', EAP = 'red4', MWS = 'purple4')) +
        theme_bw()
}

pl1 <- get_gender_plot('man', 'woman')
pl2 <- get_gender_plot('he', 'she')
pl3 <- get_gender_plot('him', 'her')

pl4 <- train %>%
    unnest_tokens(word, text) %>%
    mutate(male = (word %in% c('he', 'him', 'his', 'male', 'man', 'gentleman', 'sir', 'lord', 'men'))) %>%
    mutate(female = (word %in% c('she', 'her', 'hers', 'female', 'woman', 'lady', 'madam', 'women' ))) %>%
    unite(sex, male, female) %>%
    mutate(sex = fct_recode(as.factor(sex), male = 'TRUE_FALSE', 
                          female = 'FALSE_TRUE', other = 'FALSE_FALSE')) %>%
    filter(sex != 'other') %>%
    ggplot(aes(sex, fill = author)) +
    labs(x = 'gender indicators') +
    geom_bar(position = 'dodge') +
    scale_fill_manual(values = c(HPL = 'blue4', EAP = 'red4', MWS = 'purple4')) +
    theme_bw()

grid.arrange(pl1, pl2, pl3, pl4, nrow = 2)

train_gender <- train %>%
    unnest_tokens(word, text) %>% 
    mutate(male = (word %in% c('he', 'him', 'his', 'male', 'man', 'gentleman', 'sir', 'lord', 'men'))) %>%
    mutate(female = (word %in% c('she', 'her', 'hers', 'female', 'woman', 'lady', 'madam', 'women' ))) %>% 
    group_by(id) %>%
    summarise(male = sum(male), female = sum(female))

test_gender <- test %>%
    unnest_tokens(word, text) %>% 
    mutate(male = (word %in% c('he', 'him', 'his', 'male', 'man', 'gentleman', 'sir', 'lord', 'men'))) %>%
    mutate(female = (word %in% c('she', 'her', 'hers', 'female', 'woman', 'lady', 'madam', 'women' ))) %>% 
    group_by(id) %>%
    summarise(male = sum(male), female = sum(female))

4.6 Alliterations and assonance

get_alliterations <- function(df, stop = F){
    if(stop){
        df <- df %>%
            select(id, text) %>% 
            unnest_tokens(word, text) %>% 
            anti_join(stop_words, by = 'word')
    }else{
        df <- df %>%
            select(id, text) %>% 
            unnest_tokens(word, text)
    }

    df %>% 
        mutate(first = str_sub(word, start = 1, end = 1),
             f_lead_1 = lead(first, n = 1),
             f_lead_2 = lead(first, n = 2),
             id_lead_1 = lead(id, n = 1),
             id_lead_2 = lead(id, n = 2),
             allit2 = first == f_lead_1 & id == id_lead_1,
             allit3 = first == f_lead_1 & first == f_lead_2 & id == id_lead_1 & id == id_lead_2
             ) %>%
        filter(!is.na(allit2)) %>%
        group_by(id, allit2) %>%
        count() %>%
        spread(allit2, n) %>%
        mutate(has_allit = ifelse(!is.na(`TRUE`), 1, 0)) %>%
        select(id, has_allit)
}

train_alliterations <- get_alliterations(train, stop = T)
test_alliterations <- get_alliterations(test, stop = T)

get_allit_plot <- function(df, title){
    df %>% 
        left_join(train, by = 'id') %>% 
        group_by(author, has_allit) %>% 
        count() %>% 
        ungroup() %>% 
        (function(df) left_join(df, df %>% group_by(author) %>% summarise(s = sum(n)), by = 'author')) %>% 
        mutate(n = (n/s) * 100) %>% 
        filter(has_allit == T) %>% 
        ggplot(aes(x = author, y = n, fill = author)) +
        geom_col() +
        scale_fill_manual(values = c(HPL = 'blue4', EAP = 'red4', MWS = 'purple4')) +
        labs(y = "fraction of sentences with alliterations") +
        ggtitle(title) +
        theme_bw()
}

pl1 <- get_allit_plot(get_alliterations(train, stop = T), title = 'without stop words')
pl2 <- get_allit_plot(get_alliterations(train, stop = F), title = 'with stop words')

grid.arrange(pl1, pl2, nrow = 1)

5 Modeling

5.1 Quality measurement

MultiLogLoss <- function(y_pred, y_true) {
    if (is.matrix(y_true) == FALSE) {
        y_true <- model.matrix(~ 0 + ., data.frame(as.character(y_true)))
    }
    y_pred <- as.matrix(y_pred)
    eps <- 1e-15
    N <- nrow(y_pred)
    y_pred <- pmax(pmin(y_pred, 1 - eps), eps)
    MultiLogLoss <- (-1 / N) * sum(y_true * log(y_pred))
    return(MultiLogLoss)
}

5.2 All features

train_total <- train %>% 
    select(-c(text, clean_text)) %>% 
    left_join(train_stylometry, by = 'id') %>% 
    left_join(train_senti, by = 'id') %>% 
    left_join(train_udp_pos_count, by = 'id') %>% 
    left_join(train_udp_upos_count, by = 'id') %>% 
    left_join(train_gender, by = 'id') %>% 
    left_join(train_author_only, by = 'id') %>% 
    left_join(train_author_pair_only, by = 'id') %>% 
    left_join(train_alliterations, by = 'id') 
train_total[is.na(train_total)] <- 0

test_total <- test %>% 
    select(-c(text, clean_text)) %>% 
    left_join(test_stylometry, by = 'id') %>% 
    left_join(test_senti, by = 'id') %>% 
    left_join(test_udp_pos_count, by = 'id') %>% 
    left_join(test_udp_upos_count, by = 'id') %>% 
    left_join(test_gender, by = 'id') %>% 
    left_join(test_author_only, by = 'id') %>% 
    left_join(test_author_pair_only, by = 'id') %>% 
    left_join(test_alliterations, by = 'id')
test_total[is.na(test_total)] <- 0

5.3 Naive Bayes classifier

set.seed(42)
folds <- createFolds(train$author, k = 5)
y_train <- to_categorical(as.integer(as.factor(train$author)), num_classes = NULL)
## Loaded Tensorflow version 2.7.0
y_train <- y_train[,2:4]
df <- train %>%
    select(-c(author)) %>% 
    unnest_tokens(token, text, token = 'words', to_lower = F) %>% 
    select(-id) %>%
    count(token) %>%
    .$token
    
df_cnt <- train %>%
    select(-c(author)) %>%
    bind_rows(test) %>%
    unnest_tokens(token, text, token = 'words', to_lower = F) %>%
    count(id, token) %>%
    filter(token %in% df) %>%
    cast_dtm(id, token, n)
    # cast_dtm(id, token, n, weighting = tm::weightTfIdf)

x_train_df <-  df_cnt[train$id,] %>% as.matrix
x_test_df <- df_cnt[test$id,] %>% as.matrix
rm(df, df_cnt)
    
model_fnb <- fnb.train(x_train_df, train$author, laplace = 1)
# MultiLogLoss(y_pred = predict(model_fnb, x_train_df, type = 'raw'), y_true = y_train)

fnb <- predict(model_fnb, x_test_df, type = 'raw')
read_csv(file = 'data/sample_submission.csv') %>% 
    select(id) %>% 
    bind_cols(., fnb %>% as_tibble()) %>% 
    write_csv('submissions/sub_fnb.csv')

5.4 Neural Networks

nn_model <- keras_model_sequential() %>%
    layer_dense(units = 16, activation = 'relu', input_shape = 19579) %>% 
    layer_dense(units = 16, activation = 'relu') %>%
    layer_dense(units = 3, activation = 'softmax')

nn_model %>% compile(
    loss = 'categorical_crossentropy',
    optimizer = 'rmsprop',
    metrics = c('accuracy')
)
nn_model
## Model
## Model: "sequential"
## ________________________________________________________________________________
##  Layer (type)                       Output Shape                    Param #     
## ================================================================================
##  dense_2 (Dense)                    (None, 16)                      313280      
##                                                                                 
##  dense_1 (Dense)                    (None, 16)                      272         
##                                                                                 
##  dense (Dense)                      (None, 3)                       51          
##                                                                                 
## ================================================================================
## Total params: 313,603
## Trainable params: 313,603
## Non-trainable params: 0
## ________________________________________________________________________________

5.4.1 N-grams

get_ngram_df <- function(in_token, in_n, in_cnt){
    df <- train %>%
        select(-author) %>% 
        unnest_tokens(token, text, token = in_token, n = in_n, to_lower = F) %>% 
        select(-id) %>%
        count(token) %>%
        filter(n > in_cnt) %>%
        .$token
    
    df_cnt <- train %>%
        select(-author) %>%
        bind_rows(test) %>%
        unnest_tokens(token, text, token = in_token, n = in_n, to_lower = F) %>%
        count(id, token) %>%
        filter(token %in% df) %>%
        cast_dtm(id, token, n)
    
    if(in_cnt != 0){
        rowsRemoved <- setdiff(c(train$id,test$id),rownames(df_cnt))
        allZeros <- matrix(0, length(rowsRemoved), ncol(df_cnt), 
                           dimnames = list(rowsRemoved, colnames(df_cnt)))
        df_cnt <- df_cnt %>% rbind(allZeros)
        rm(rowsRemoved,allZeros)
    }
    
    x_train_df <- df_cnt[train$id,] %>% as.matrix
    x_test_df <- df_cnt[test$id,] %>% as.matrix
    rm(df, df_cnt)
        
    return(list(x_train_df, x_test_df))
}

get_ngram_model <- function(df_train, df_test, fold, y_train, filepath){
    nn_model <- keras_model_sequential() %>%
        layer_dense(units = 16, activation = 'relu', input_shape = ncol(df_train)) %>% 
        layer_dense(units = 16, activation = 'relu') %>%
        layer_dense(units = 3, activation = 'softmax')
    
    nn_model %>% compile(
        loss = 'categorical_crossentropy',
        optimizer = 'rmsprop',
        metrics = c('accuracy')
    )
    
    frozen_nn_model <- nn_model %>%
        fit(
            df_train[-fold,], y_train[-fold,],
            batch_size = 2^9,
            epochs = 20,
            validation_split = 0.1,
            verbose = F,
            callbacks = list(
                callback_early_stopping(monitor = 'val_loss', patience = 2),
                callback_model_checkpoint(
                    filepath = paste0(filepath, '.hdf5'),
                    monitor = 'val_loss',
                    mode = 'min',
                    save_best_only = T)
          )
    )
    nn_model <- load_model_hdf5(paste0(filepath, '.hdf5'))
    
    train_pred <- nn_model %>%
        predict(df_train[fold,])
    test_pred <- nn_model %>%
        predict(df_test)
    fold_eval <- nn_model %>%
        evaluate(df_train[fold,], y_train[fold,])
    out <- list(train_pred = train_pred, test_pred = test_pred, logloss = fold_eval[["loss"]], acc = fold_eval[["accuracy"]] )
    k_clear_session()
    return(out)
}

get_ngram_predictions <- function(ngram_df, filepath){
    train_count <- matrix(0, nrow = nrow(train), ncol = 3)
    test_count <- matrix(0, nrow = nrow(test), ncol = 3)
    metrics_count <- matrix(0, 5, 2)

    for(i in 1:5){
        results_count <- get_ngram_model(ngram_df[[1]], ngram_df[[2]], folds[[i]], y_train, filepath)
        train_count[folds[[i]], ] <- results_count$train_pred
        test_count <- test_count + (results_count$test_pred)/5
        metrics_count[i,1] <- results_count$logloss
        metrics_count[i,2] <- results_count$acc
        gc()
    }   
  
    train_count <- train_count %>%
        as.data.frame()
    test_count <- test_count %>%
        as.data.frame()
    metrics_count <- metrics_count %>%
        as.data.frame()
    rownames(metrics_count) <- paste0("fold ", 1:5, ":")
    return(list(train_count, test_count, metrics_count))
}

5.4.1.1 Unigrams

ngram_df <- get_ngram_df(in_token = 'ngrams', in_n = 1, in_cnt = 0)
ngram_predictions <- get_ngram_predictions(ngram_df, 'word_1g_count')
train_word_1g_count <- ngram_predictions[[1]] %>%  
    rename(word_1g_count_EAP=V1, word_1g_count_HPL=V2, word_1g_count_MWS=V3)
test_word_1g_count <- ngram_predictions[[2]] %>% 
    rename(word_1g_count_EAP=V1, word_1g_count_HPL=V2, word_1g_count_MWS=V3)
metrics_word_1g_count <- ngram_predictions[[3]] %>% 
    rename(logloss=V1, acc=V2)
rm(ngram_df, ngram_predictions)
invisible(gc())
metrics_word_1g_count
logloss acc
fold 1: 0.3703454 0.8595147
fold 2: 0.3955860 0.8409091
fold 3: 0.3913905 0.8414198
fold 4: 0.4029367 0.8429520
fold 5: 0.4126425 0.8429520

5.4.1.2 Bigrams

ngram_df <- get_ngram_df(in_token = 'ngrams', in_n = 2, in_cnt = 2)
ngram_predictions <- get_ngram_predictions(ngram_df, 'word_2g_count')
train_word_2g_count <- ngram_predictions[[1]] %>%  
    rename(word_2g_count_EAP=V1, word_2g_count_HPL=V2, word_2g_count_MWS=V3)
test_word_2g_count <- ngram_predictions[[2]] %>% 
    rename(word_2g_count_EAP=V1, word_2g_count_HPL=V2, word_2g_count_MWS=V3)
metrics_word_2g_count <- ngram_predictions[[3]] %>% 
    rename(logloss=V1, acc=V2)
rm(ngram_df, ngram_predictions)
invisible(gc())
metrics_word_2g_count
logloss acc
fold 1: 0.5756299 0.7652618
fold 2: 0.5688075 0.7614913
fold 3: 0.5739298 0.7643003
fold 4: 0.5635101 0.7778345
fold 5: 0.6054180 0.7492339

5.4.1.3 Trigrams

ngram_df <- get_ngram_df(in_token = 'ngrams', in_n = 3, in_cnt = 2)
ngram_predictions <- get_ngram_predictions(ngram_df, 'word_3g_count')
train_word_3g_count <- ngram_predictions[[1]] %>%  
    rename(word_3g_count_EAP=V1, word_3g_count_HPL=V2, word_3g_count_MWS=V3)
test_word_3g_count <- ngram_predictions[[2]] %>% 
    rename(word_3g_count_EAP=V1, word_3g_count_HPL=V2, word_3g_count_MWS=V3)
metrics_word_3g_count <- ngram_predictions[[3]] %>% 
    rename(logloss=V1, acc=V2)
rm(ngram_df, ngram_predictions)
invisible(gc())
metrics_word_3g_count
logloss acc
fold 1: 0.8679376 0.6053640
fold 2: 0.8813269 0.6016343
fold 3: 0.8753681 0.5911645
fold 4: 0.8750075 0.5929520
fold 5: 0.9087745 0.5891216

5.4.1.4 4-grams

ngram_df <- get_ngram_df(in_token = 'ngrams', in_n = 4, in_cnt = 2)
ngram_predictions <- get_ngram_predictions(ngram_df, 'word_4g_count')
train_word_4g_count <- ngram_predictions[[1]] %>%  
    rename(word_4g_count_EAP=V1, word_4g_count_HPL=V2, word_4g_count_MWS=V3)
test_word_4g_count <- ngram_predictions[[2]] %>% 
    rename(word_4g_count_EAP=V1, word_4g_count_HPL=V2, word_4g_count_MWS=V3)
metrics_word_4g_count <- ngram_predictions[[3]] %>% 
    rename(logloss=V1, acc=V2)
rm(ngram_df, ngram_predictions)
invisible(gc())
metrics_word_4g_count
logloss acc
fold 1: 1.050763 0.4319285
fold 2: 1.042072 0.4364147
fold 3: 1.040633 0.4371808
fold 4: 1.038590 0.4389683
fold 5: 1.034376 0.4486721

5.4.2 Character-based N-grams

get_char_ngram_df <- function(in_token, in_n, in_cnt){
  
  if(in_n <= 3){
    
      df <- map_df(1:in_n, ~ unnest_tokens(train %>% select(-author), token, text, 
                                        token = in_token, n = .x, to_lower = F, 
                                        lowercase = F, strip_non_alphanum = F)) %>%
          select(-id) %>%
          count(token) %>%
          filter(n > in_n) %>% 
          .$token

      df_cnt <- map_df(1:in_n, ~ unnest_tokens(train %>% select(-author) %>% bind_rows(test), token, text, 
                                                 token = in_token, n = .x, to_lower = F, 
                                                 lowercase = F, strip_non_alphanum = F)) %>%
          count(id, token) %>%
          filter(token %in% df) %>%
          cast_dtm(id, token, n) %>%
          as.matrix()
  }else{
  
      df <- train %>%
          select(-author) %>% 
          unnest_tokens(token, text, token = in_token, n = in_n, to_lower = F, lowercase = F, strip_non_alphanum = F) %>% 
          select(-id) %>%
          count(token) %>%
          filter(n > in_n) %>%
          .$token
      
      df_cnt <- train %>%
          select(-author) %>%
          bind_rows(test) %>%
          unnest_tokens(token, text, token = in_token, n = in_n, to_lower = F, lowercase = F, strip_non_alphanum = F) %>%
          count(id, token) %>%
          filter(token %in% df) %>%
          cast_dtm(id, token, n)
  }
  
    if(in_cnt != 0){
        rowsRemoved <- setdiff(c(train$id,test$id),rownames(df_cnt))
        allZeros <- matrix(0, length(rowsRemoved), ncol(df_cnt), 
                           dimnames = list(rowsRemoved, colnames(df_cnt)))
        df_cnt <- df_cnt %>% rbind(allZeros)
        rm(rowsRemoved,allZeros)
    }
    
    x_train_df <- df_cnt[train$id,] %>% as.matrix
    x_test_df <- df_cnt[test$id,] %>% as.matrix
    rm(df, df_cnt)
        
    return(list(x_train_df, x_test_df))
}

5.4.2.1 N <= 3

ngram_df <- get_char_ngram_df(in_token = 'character_shingles', in_n = 3, in_cnt = 0)
ngram_predictions <- get_ngram_predictions(ngram_df, 'char_3g_count')
train_char_3g_count <- ngram_predictions[[1]] %>%  
    rename(char_3g_count_EAP=V1, char_3g_count_HPL=V2, char_3g_count_MWS=V3)
test_char_3g_count <- ngram_predictions[[2]] %>% 
    rename(char_3g_count_EAP=V1, char_3g_count_HPL=V2, char_3g_count_MWS=V3)
metrics_char_3g_count <- ngram_predictions[[3]] %>% 
    rename(logloss=V1, acc=V2)
rm(ngram_df, ngram_predictions)
invisible(gc())
metrics_char_3g_count
logloss acc
fold 1: 0.5673450 0.7803320
fold 2: 0.4941036 0.7929009
fold 3: 0.5223595 0.7890705
fold 4: 0.5462490 0.7870276
fold 5: 0.5255454 0.7926456

5.4.2.2 N = 4

ngram_df <- get_char_ngram_df(in_token = 'character_shingles', in_n = 4, in_cnt = 2)
ngram_predictions <- get_ngram_predictions(ngram_df, 'char_4g_count')
train_char_4g_count <- ngram_predictions[[1]] %>%  
    rename(char_4g_count_EAP=V1, char_4g_count_HPL=V2, char_4g_count_MWS=V3)
test_char_4g_count <- ngram_predictions[[2]] %>% 
    rename(char_4g_count_EAP=V1, char_4g_count_HPL=V2, char_4g_count_MWS=V3)
metrics_char_4g_count <- ngram_predictions[[3]] %>% 
    rename(logloss=V1, acc=V2)
rm(ngram_df, ngram_predictions)
invisible(gc())
metrics_char_4g_count
logloss acc
fold 1: 0.3967863 0.8487867
fold 2: 0.4362471 0.8329929
fold 3: 0.4243933 0.8340143
fold 4: 0.4073804 0.8432074
fold 5: 0.4256585 0.8363126

5.4.2.3 N = 5

ngram_df <- get_char_ngram_df(in_token = 'character_shingles', in_n = 5, in_cnt = 2)
ngram_predictions <- get_ngram_predictions(ngram_df, 'char_5g_count')
train_char_5g_count <- ngram_predictions[[1]] %>%  
    rename(char_5g_count_EAP=V1, char_5g_count_HPL=V2, char_5g_count_MWS=V3)
test_char_5g_count <- ngram_predictions[[2]] %>% 
    rename(char_5g_count_EAP=V1, char_5g_count_HPL=V2, char_5g_count_MWS=V3)
metrics_char_5g_count <- ngram_predictions[[3]] %>% 
    rename(logloss=V1, acc=V2)
rm(ngram_df, ngram_predictions)
invisible(gc())
metrics_char_5g_count
logloss acc
fold 1: 0.4010104 0.8498085
fold 2: 0.4001806 0.8393769
fold 3: 0.4102779 0.8398876
fold 4: 0.4074607 0.8472931
fold 5: 0.4101155 0.8439735

5.4.2.4 N = 6

ngram_df <- get_char_ngram_df(in_token = 'character_shingles', in_n = 6, in_cnt = 2)
ngram_predictions <- get_ngram_predictions(ngram_df, 'char_6g_count')
train_char_6g_count <- ngram_predictions[[1]] %>%  
    rename(char_6g_count_EAP=V1, char_6g_count_HPL=V2, char_6g_count_MWS=V3)
test_char_6g_count <- ngram_predictions[[2]] %>% 
    rename(char_6g_count_EAP=V1, char_6g_count_HPL=V2, char_6g_count_MWS=V3)
metrics_char_6g_count <- ngram_predictions[[3]] %>% 
    rename(logloss=V1, acc=V2)
rm(ngram_df, ngram_predictions)
invisible(gc())
metrics_char_6g_count
logloss acc
fold 1: 0.4114417 0.8416348
fold 2: 0.4198805 0.8373340
fold 3: 0.4263317 0.8324821
fold 4: 0.4101219 0.8455056
fold 5: 0.4218861 0.8432074

5.4.3 POS-tagged NN

train_tag <- train_pos_tagged %>% 
    mutate(pos = str_replace_all(.$xpos, '\\$', 'S')) %>%
    group_by(doc_id) %>%
    summarise(text_tag = str_c(pos, collapse = ' '))

test_tag <- test_pos_tagged %>% 
    mutate(pos = str_replace_all(.$xpos, '\\$', 'S')) %>%
    group_by(doc_id) %>%
    summarise(text_tag = str_c(pos, collapse = ' '))

df <- map_df(1:4, ~ unnest_tokens(train_tag, token, text_tag, 
                                        token = 'ngrams', to_lower = F, n = .x)) %>%
    count(token) %>%
    filter(n > 2) %>%
    .$token

df_cnt <- map_df(1:4, ~ unnest_tokens(train_tag %>% bind_rows(test_tag), token, 
                                              text_tag, token = 'ngrams', to_lower = F, n = .x)) %>%
    rename(id = doc_id) %>% 
    count(id, token) %>%
    filter(token %in% df) %>%
    cast_dtm(id, token, n) %>%
    as.matrix()

x_train_df <- df_cnt[train$id,] %>% as.matrix
x_test_df <- df_cnt[test$id,] %>% as.matrix
rm(df, df_cnt)

ngram_predictions <- get_ngram_predictions(list(x_train_df, x_test_df), 'pos_count')
train_pos_count <- ngram_predictions[[1]] %>%  
    rename(pos_count_EAP=V1, pos_count_HPL=V2, pos_count_MWS=V3)
test_pos_count <- ngram_predictions[[2]] %>% 
    rename(pos_count_EAP=V1, pos_count_HPL=V2, pos_count_MWS=V3)
metrics_pos_count <- ngram_predictions[[3]] %>% 
    rename(logloss=V1, acc=V2)
rm(ngram_df, ngram_predictions)
invisible(gc())
metrics_pos_count
logloss acc
fold 1: 0.8632275 0.6084291
fold 2: 0.8573496 0.6080184
fold 3: 0.8700902 0.6016343
fold 4: 0.8893613 0.5985700
fold 5: 0.8655088 0.6133810

5.5 Embedding-based NN

5.5.1 Custom Embeddings

max_features <- 10000

tokenizer <- text_tokenizer(num_words = max_features, lower = T) %>%
    fit_text_tokenizer(train$text)  
train_seq <- texts_to_sequences(tokenizer, train$text)
test_seq <- texts_to_sequences(tokenizer, test$text)
max_len <- 60
train_seq_padded <- pad_sequences(train_seq, maxlen = max_len)
test_seq_padded <- pad_sequences(test_seq, maxlen = max_len)

5.5.2 LSTM

set.seed(42)

model <- keras_model_sequential() %>% 
    layer_embedding(input_dim = max_features + 1, output_dim = 45, input_length = max_len) %>%
    layer_dropout(rate = 0.2) %>%
    layer_lstm(units = 45, recurrent_dropout = 0.2, dropout = 0.2) %>%
    layer_dense(units = 16, activation = 'relu') %>%
    layer_dropout(rate = 0.2) %>%
    layer_dense(units = 3, activation = 'softmax') %>% 
    compile(
        loss = 'binary_crossentropy',
        optimizer = 'rmsprop',
        metrics = 'accuracy'
        )
    
    
lstm <- model %>%
    fit(
      train_seq_padded, y_train,
      batch_size = 2^9, 
      epochs = 20,
      validation_split = 0.1,
      verbose = T,
      callbacks = list(
          callback_early_stopping(monitor = 'val_loss', patience = 2),
          callback_model_checkpoint(
              filepath = 'word_vec_lstm.hdf5',
              monitor = 'val_loss',
              save_best_only = TRUE)
          )
      )

lstm_m <- load_model_hdf5('word_vec_lstm.hdf5')

test_pred_lstm <- lstm_m %>%
    predict(test_seq_padded) %>%
    as.data.frame() %>% 
    rename(lstm_EAP=V1, lstm_HPL=V2, lstm_MWS=V3)

train_pred_lstm <- lstm_m %>%
    predict(train_seq_padded) %>%
    as.data.frame() %>% 
    rename(lstm_EAP=V1, lstm_HPL=V2, lstm_MWS=V3)
    
fold_eval <- lstm_m %>% evaluate(train_seq_padded, y_train)

read_csv(file = 'data/sample_submission.csv') %>% 
    select(id) %>% 
    bind_cols(., test_pred_lstm %>% as_tibble()) %>% 
    rename(EAP = lstm_EAP, HPL = lstm_HPL, MWS = lstm_MWS) %>% 
    write_csv('submissions/sub_lstm.csv')

plot(lstm)

5.5.3 BI-LSTM

set.seed(42)

model <- keras_model_sequential() %>% 
    layer_embedding(input_dim = max_features, output_dim = 45, input_length = max_len) %>%
    bidirectional(layer_lstm(units = 45, recurrent_dropout = 0.2, return_sequences = T)) %>%
    layer_global_max_pooling_1d() %>% 
    layer_dropout(rate = 0.2) %>% 
    layer_dense(units = 3, activation = 'softmax') %>% 
    compile(
        loss = "binary_crossentropy",
        optimizer = "rmsprop",
        metrics = "accuracy"
        )
    
bilstm <- model %>%
    fit(
      train_seq_padded, y_train,
      batch_size = 2^9, 
      epochs = 20,
      validation_split = 0.1,
      verbose = T,
      callbacks = list(
          callback_early_stopping(monitor = 'val_loss', patience = 2),
          callback_model_checkpoint(
              filepath = 'word_vec_bilstm.hdf5',
              monitor = 'val_loss',
              save_best_only = T)
          )
      )

bilstm_m <- load_model_hdf5('word_vec_bilstm.hdf5')

test_pred_bilstm <- bilstm_m %>%
    predict(test_seq_padded) %>%
    as.data.frame() %>% 
    rename(bi_lstm_EAP=V1, bi_lstm_HPL=V2, bi_lstm_MWS=V3)

train_pred_bilstm <- bilstm_m %>%
    predict(train_seq_padded) %>%
    as.data.frame() %>% 
    rename(bi_lstm_EAP=V1, bi_lstm_HPL=V2, bi_lstm_MWS=V3)


fold_eval <- bilstm_m %>% evaluate(train_seq_padded, y_train)

read_csv(file = 'data/sample_submission.csv') %>% 
    select(id) %>% 
    bind_cols(., test_pred_bilstm %>% as_tibble()) %>% 
    rename(EAP = bi_lstm_EAP, HPL = bi_lstm_HPL, MWS = bi_lstm_MWS) %>% 
    write_csv('submissions/sub_bilstm.csv')

plot(bilstm)

5.6 Ensemble

train_all_sp <- train %>% select(id, author) %>% 
    bind_cols(train_word_1g_count) %>%
    bind_cols(train_word_2g_count) %>%
    bind_cols(train_word_3g_count) %>%
    bind_cols(train_word_4g_count) %>% 
    bind_cols(train_char_3g_count) %>% 
    bind_cols(train_char_4g_count) %>% 
    bind_cols(train_char_5g_count) %>% 
    bind_cols(train_char_6g_count) %>% 
    bind_cols(train_pos_count) %>% 
    bind_cols(train_pred_lstm) %>%   
    bind_cols(train_pred_bilstm)


test_all_sp <- test %>% select(id) %>% 
    bind_cols(test_word_1g_count) %>%
    bind_cols(test_word_2g_count) %>%
    bind_cols(test_word_3g_count) %>%
    bind_cols(test_word_4g_count) %>% 
    bind_cols(test_char_3g_count) %>% 
    bind_cols(test_char_4g_count) %>% 
    bind_cols(test_char_5g_count) %>% 
    bind_cols(test_char_6g_count) %>% 
    bind_cols(test_pos_count) %>% 
    bind_cols(test_pred_lstm) %>%   
    bind_cols(test_pred_bilstm)

train_all_sp %>%
    select(-c(id)) %>%
    group_by(author) %>%
    summarise_all(mean) %>%
    gather(feature, value, -author) %>%
    ggplot(aes(x = feature, y = value, color = author, fill = author, size = value)) +
    scale_color_manual(values = c(HPL = 'blue4', EAP = 'red4', MWS = 'purple4')) +
    geom_point(alpha = 0.6, size = 3) +
    labs(x = 'feature', y = 'mean value by author') +
    coord_flip() +
    theme_bw()

dtrain <- train_total %>% 
    left_join(train_all_sp, by = c('id', 'author'))
dtrain[is.na(dtrain)] <- 0

dtest <- test_total %>% 
    left_join(test_all_sp, by = 'id')
dtest[is.na(dtest)] <- 0
dim(dtrain)
## [1] 19579   137
dim(dtest)
## [1] 8392  137

5.6.1 NN features correlation

sCorr <- cor(dtrain %>% select(-id, -author), method = 'spearman')
sigCorr <- findCorrelation(sCorr, cutoff = .8, names = TRUE)
ggcorrplot(cor(dtrain %>% select(sigCorr), method = 'spearman'))
## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(sigCorr)` instead of `sigCorr` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.

5.6.2 XGBoost

customSummary <- function(data, lev = levels(data$obs), model = NULL) {
    mcs <- multiClassSummary(data, lev = levels(data$obs), model = NULL)
    mnll <- mnLogLoss(data, lev = levels(data$obs), model = NULL)
    out <- c(mnll, mcs['Accuracy'])
}


set.seed(42)
xgbt <- train(x = dtrain %>% select(-c(id, author)), 
                  y = factor(dtrain$author),
                  method = 'xgbTree',
                  metric = 'logLoss',
                  tuneGrid = expand.grid(nrounds = seq(100,1400,100),
                                         max_depth = 4, 
                                         eta = 0.02,
                                         gamma = 0.5,
                                         colsample_bytree = 0.35,
                                         min_child_weight = 4,
                                         subsample = 0.85),
                  trControl = trainControl(method = "cv",
                                           number = 10,
                                           classProbs = TRUE,
                                           summaryFunction = customSummary,
                                           search = 'grid')
)

5.7 Final submission

xgbt
## eXtreme Gradient Boosting 
## 
## 19579 samples
##   135 predictor
##     3 classes: 'EAP', 'HPL', 'MWS' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 17621, 17620, 17621, 17621, 17622, 17622, ... 
## Resampling results across tuning parameters:
## 
##   nrounds  logLoss    Accuracy 
##    100     0.2812983  0.9395277
##    200     0.1914188  0.9410088
##    300     0.1747006  0.9423876
##    400     0.1679020  0.9438177
##    500     0.1640650  0.9456054
##    600     0.1617971  0.9461161
##    700     0.1600828  0.9464736
##    800     0.1591452  0.9467288
##    900     0.1584492  0.9469330
##   1000     0.1581400  0.9473416
##   1100     0.1579513  0.9473416
##   1200     0.1579438  0.9470353
##   1300     0.1579343  0.9470864
##   1400     0.1581504  0.9471884
## 
## Tuning parameter 'max_depth' was held constant at a value of 4
## Tuning
## 
## Tuning parameter 'min_child_weight' was held constant at a value of 4
## 
## Tuning parameter 'subsample' was held constant at a value of 0.85
## logLoss was used to select the optimal model using the smallest value.
## The final values used for the model were nrounds = 1300, max_depth = 4, eta
##  = 0.02, gamma = 0.5, colsample_bytree = 0.35, min_child_weight = 4
##  and subsample = 0.85.
sub_xgbt <- read_csv('data/sample_submission.csv') %>%
    select(id) %>%
    bind_cols(predict(xgbt, dtest %>% select(-c(id)), type = 'prob'))
write_excel_csv(sub_xgbt, 'submissions/stacked_xgb_short.csv')